Aki Shiroshita (Epidemiology PhD student, akihiro.shiroshita@vanderbilt.edu) developed a tailored version of DeGAUSS specifically for the EV project.
About Original DeGAUSS
DeGAUSS (https://degauss.org/) is designed to derive environmental variables while preserving the privacy of protected health information (PHI). It uses Docker images to process address data, Users upload a CSV file containing address information and receive an output file with various environmental variables.
Limitations of Original DeGAUSS
Original DeGAUSS may not be so flexible.
Not optimized for very large datasets.
Requires input and output files to be stored in the same folder.
Does not support creation of custom variables.
Improvements in the Modified DeGAUSS
Avoid reliance on Docker, using Podman only for part of the process.
Utilize C and C++ in the backend wherever possible.
Enable parallel processing with multiple cores.
Restrict environmental data to Tennessee only, not the entire U.S.
Modified DeGAUSS provides clean, processed output files with all PHI removed.
What can we get through Modified DeGAUSS?
| Category | Variable name | Data source | Description |
|---|---|---|---|
| Parsing and normalizing | address |
libpostal | cleaned address |
| Geocoding | lon |
TIGER/Line Street Range Address | longitude and latitude |
lat |
TIGER/Line Street Range Address | longitude and latitude | |
| Road proximity | dist_to_1100 |
U.S. Census Bureau | distance (meters) to the nearest S1100 road |
dist_to_1200 |
U.S. Census Bureau | distance (meters) to the nearest S1200 road | |
length_1100 |
U.S. Census Bureau | length (meters) of S1100 roads within a 400 m buffer | |
length_1200 |
U.S. Census Bureau | length (meters) of S1200 roads within a 400 m buffer | |
| Traffic density | length_moving |
U.S. Department of Transportation Federal Highway Administration | total length of interstates, expressways, and freeways (meters) |
length_stop_go |
U.S. Department of Transportation Federal Highway Administration | total length of arterial roads (meters) | |
vehicle_meters_moving |
U.S. Department of Transportation Federal Highway Administration | average daily number of vehicles multiplied by the length of interstates, expressways, and freeways (vehicle-meters) | |
vehicle_meters_stop_go |
U.S. Department of Transportation Federal Highway Administration | average daily number of vehicles multiplied by the length of arterial roads (vehicle-meters) | |
truck_meters_moving |
U.S. Department of Transportation Federal Highway Administration | average daily number of trucks multiplied by the length of interstates, expressways, and freeways (truck-meters) | |
truck_meters_stop_go |
U.S. Department of Transportation Federal Highway Administration | average daily number of trucks multiplied by the length of arterial roads (truck-meters) | |
| New road proximity and traffic density | dist_near |
U.S. Department of Transportation Federal Highway Administration | distance (meters) to the nearest interstates, expressways, or freeways |
aadt_near |
U.S. Department of Transportation Federal Highway Administration | average daily number of vehicles of the nearest interstates, expressways, or freeways | |
| Redlining categories | redlining |
Mapping Inequality | Historic HOLC classifications (A, B, C, and D) |
| Greenspace | evi_500 |
LP DAAC MOD13Q1 | average enhanced vegetation index within a 500 meter buffer radius |
evi_1500 |
LP DAAC MOD13Q1 | average enhanced vegetation index within a 1500 meter buffer radius | |
evi_2500 |
LP DAAC MOD13Q1 | average enhanced vegetation index within a 2500 meter buffer radius | |
| Deprivation score | fraction_assisted_incom |
2018 American Community Survey | fraction of households receiving public assistance income or food stamps or SNAP in the past 12 months |
fraction_high_school_edu |
2018 American Community Survey | fraction of population 25 and older with educational attainment of at least high school graduation (includes GED equivalency) | |
median_income |
2018 American Community Survey | median household income in the past 12 months in 2018 inflation-adjusted dollars | |
fraction_no_health_ins |
2018 American Community Survey | fraction of population with no health insurance coverage | |
fraction_poverty |
2018 American Community Survey | fraction of population with income in past 12 months below poverty level | |
fraction_vacant_housing |
2018 American Community Survey | fraction of houses that are vacant | |
dep_index |
2018 American Community Survey | composite measure of the 6 variables above | |
| Air pollutants | average_no2_infancy |
Original Schwartz model | Average daily NO2 levels during infancy |
average_bc_infancy |
Provided by Kai Zhang | Average monthly black carbon levels during infancy |
In addition,
Number of children per census block (blocks with fewer than 11 children are masked as “<11”).
Indicator of whether each child experienced relocation and changes in environmental exposures (e.g., moving closer to or farther from major roads).
How to Use Modified DeGAUSS
The environment has already been set up for you. All you need to do is follow the instructions.
Step-by-Step Instructions
- Locate the Folder:
Navigate to the folder “C:_degauss_2025_08_14” on the Windows server (Cqshealth.dhcp.mc.vanderbilt.edu).
- Open R Project:
Launch R Studio.
Note: It may take 1–2 minutes to open, as the R Studio settings have been customized for this project. Please wait each time you run the program until items appear in the environment.
- Start Podman:
Open the Command Prompt and run
podman machine start.
- Run R Script:
Open the file test.R.
Execute the script section by section using the shortcut:
Place your cursor in the section and press Ctrl + Alt + T.
- Locate Output Files:
Processed data will be saved in any folder of your choice.
This folder contains CSV files, including: tract.csv
(used for subject selection flow), final_data.csv (the
final dataset for sharing with other researchers, with all PHI removed),
tab_census.csv (census tract tabulation data), and
tab_relocation.csv (relocation information).”
Specific instructions for Huiping
- Map your shared drive containing address information to our Windows server
Note: Your data will remain on the shared drive and will never leave the VUMC environment.The server will load data into memory for processing, but data will not be stored in local server folders. Any temporary cache generated during processing will be automatically removed.
- Folder choice
Could you provide the path to the input folder containing the address data and the file name?
What is the path to the output folder where you’d like to store the processed data after removing all PHI data?
If you would like to create a temporary folder in a different location to store intermediate files containing PHI, please specify the path.
Defining start date and end date
For defining start date and end data, we
need merge any overlapping or adjacent enrollment periods into single,
continuous time spans. This ensures there are no gaps in the
timeline.
TennCare enrollment file is like this:
| recip | enrol_begin_date | enrol_end_date | address |
|---|---|---|---|
| 1 | 2023-01-01 | 2024-01-02 | 123 Main St |
| 1 | 2024-01-02 | 2025-03-02 | 456 Elm St |
| 2 | 2022-01-02 | 2023-o1-02 | 789 Oak St |
not like this:
| recip | registration_date | address |
|---|---|---|
| 1 | 2023-01-01 | 123 Main St |
| 1 | 2024-01-02 | 456 Elm St |
| 2 | 2022-01-02 | 789 Oak St |
Delete modified DeGAUSS
Once all processes are completed and the required outputs are finalized, I will delete the modified DeGAUSS from the server.